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ABSTRACT 

The idea of pseudo-classification based on external relevance 
is introduced and compared with the more usual classifications derived 
by associative techniques. A general model for an information retrieval 
system using term classification is described. The derivation of a 
set of operators^ or perturbations, for ad;)usting pseudo-classifications 
and preventing their deterioration is given for a particular match 
function conforming with this model. The use of pseudo-classifications 
both for the prediction of relevant documents and for the evaluation of 
retrieval systems with respect to their theoretical optimum is discussed. 
The concept of the improvability of a retrieval model with respect to 
its constituent submodels is introduced and elaborated upon. 
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SECTION 1 
1. INTRODUCTION 



1,1 The Use of Relevance in 

Evaluative Retrieval System 

In recent years a nuinher of experiments have been performed to 
examine the application of term classifications to information retrieval. 
It seems evident from the work of Salton (l), and of Sparck Jones and 
Jackson (2) that a certain measure of improvement in performance^ over 
simple term retrieval may he obtained by using classifications generated 
automatically. Both groups have established independently that small 
tightly structured classes of terms constitute a classification favourable 
to retrieval applications, but it is as yet unknown whether the improve- 
ment gained by using such classifications may be increased still further. 
The classes used in these experiments have in the main been generated 
automatically by making use of the co-occurrence of terms in documents. 
Doyle (3) has discussed the question of using co-occurrence as a measure 
of similarity and has pointed out a number of difficulties which this 
raises. It is, however, not at all easy to see how his criticisms can be 
met in practice without the danger of introducing further difficulties. 

^Performance throughout this paper relates only to the retrieval 
or non-retrieval of relevant or non-relevant documents. No account is 
taken of hardware factors.* Thus, for example, the amount of effort 
e 3 q>ended in extracting relevant documents is not taken into consideration. 
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In the experiments of Sparck Jones and Jackson (4), attention has 
been focused specifically on the effect of automatically generated classes 
on retrieval performance* Their studies have involved research both in 
classification theory and information retrieval theory, and have been 
directed towards finding classification and retrieval algorithms which 
result in improvements in performance beyond that obtained by simple key- 
word coordination* An assessment of the effect of two different classi- 
fications of the same document collection and the same set of requests 
on retrieval performance involves a comparison between the two recall- 
precision curves* This was achieved by *rule of thumb* since the authors 
were interested principally in an overall improvement for all coordina- 
tion levels* Thus, when the curves for two experiments crossed, improve- 
ment was regarded as uncertain* Salt on and Lesk ( 5 ) have used statisti- 
cal methods to establish whether a limited number of classifications and 
thesauri display consistent improvement over a number of different docu- 
ment collections* Another evaluation measure has been developed by 
Svets (6), who uses a decision theoretic approach* 

The approaches outlines above are similar in a number of respects. 
First, both use the term descriptions of the documents in the collection 
to produce term co-occurrences, and thence a classification of terms* 
Secondly, both attempt to evaluate the performance of the retrieval 
procedure by using the base set of requests and a table of documents re- 
levant to each request* Thirdly, neither attempts to adjust a classi- 
fication on the basis of information gathered during evaluation* This 
is not intended as a criticism of these approaches, for both set out to es- 
tablish whether classifications derived solely from co-occurrences of terms 
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vi'bhln docuioen'bs can improve perfonnance • A subseQ.uen't adjustment of the 
cXassiflcatlon would, therefore, he outside the terms of reference of 
these investigations. Finally, neither can give a clear indication of 
the best performance which theoretically may be obtained from their 
retrieval systems given the document collection, the base set of requests 
and the table of documents relevant to the base set of requests. Some 
work in this general direction, however, has been done by Cleverdon and 
Keen (?), who have examined the exhaustivity and specificity of several 
index langungss . Their arguments are based on the assumption that the 
user who approaches the document collection will, in all probability, not 
be familiar with the keywords which are available for specifying his re- 
quest. Indeed, in a system which is designed to serve the user, there is 
no reason why he should be. They have put into clear relief the problem 
of ascertaining the actual intention of the user when he formulates his 
request, have called in question the meaning and interpretation of 
relevance. For example, are documents judged relevant on the basis of 
the actual request the user formulated or are they Judged on the basis 
of the request he should have formulated, had he complete knowledge of 
the term vocabulary and a clearer idea of his own request? 

These remarks make it clear that there are four sources of in- 
formation for an evaluative retrieval system, namely, the document collec- 
tion, relev€uit to the base set of requests. The latter, together with a 
performance measure, enables the system to evaluate its performance by 
comparing lists of documents actually retrieved \/ith lists of documents 
it should have retrieved. Not all of the four sources of information are 
independent of each other. Under the hypothesis that: 
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1.1.1 * co-occurrence of terms within documents is a suitable measure of 

the similarity between terras* a classification may be generated automati- 
cally. This hypothesis vdll be referred to as the Association Hypothesis . 
Alternatively, the classification may be produced manually. The depend- 
ence between the document collection, the base set of requests, the terra 
classification and the table of relevant documents is, however, less 
tractable. Suppose, for example, that a match function, which measures 
the coefficient of matching between a document and a request, is designed, 
and that this coefficient varies with the coordination level of a match 
both on terms and on classes. Then with the hypothesis, which will be 
called the Relevance Hypothesis, that; 



1.1.2 ‘coordination is positively correlated with external relevance* 
(i.e., that relevance may be defined algorithmically) we could imagine a 
classification and a match function which retrieved all and only re- 
levant documents, namely those with high match coefficients. In practice, 
however, this position is seldom attained; partly because the match 
function is an imperfect approximation to that function which actually 
corresponds to external relevance, if indeed such exists; partly because 

the classification is defective; partly because there may, in fact, not 

2 

be sufficient infomnation provided to the system to achieve the best 
performance; and partly because the notion of relevance may not be well 
formed and may result in inconsistencies of some sort. The difficulty 
is that we are attempting to simulate the external judging of relevance 

p 

*^Best performance is used here in the sense of complete agreement 
with the document -request relevance table. 
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by using an assortment of classification algorithms based on the 
association hypothesis and an assortment of match functions based on the 
relevance hypothesis. Moreover, the simulation is undirected since we 
have no guide as to how a better simulation may be achieved. It may be 
that the lack of success in the field is attributable to the fact that 
the approach to retrievsl we are using is over determined. We have 
attempted to inject more information into the system by introducing 
hypotheses rather than 1:^ utilizing the information available in a more 
economical fashion. It is the utilization of the available information 
in a more economical way and the ideas which this leads to which this 
paper is to explore. 

1,2 The Use of Relevance in Pseudo- 
Classifications 

The relationships which hold between the categories of information 
used in an evaluative retrieval system were described briefly above. To 
mnWp descriptions of this type more precise it is convenient to introduce 
the notion of model . Suppose that a set of experiments is designed to 
examine a particular aspect of information retrieval. Functions and proc- 
esses are designed according to a set of rules and these, when fused to- 
gether, form the retrieval system. The set of rules which govern the 
construction of each part of the system represent a description of a 
particular approach to the solution of information retrieval. These rules 
themselves may be, in effect, hypotheses within the field of information 
retrieval and an evaluative retrieval system allows these hypotheses to 
be tested. A complete knov/ledge of the set of rules gives, ideally, com- 
plete knowledge of a particular approach to information retrieval. The 
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set of rules is called a model. The set of rules may be such that a 
number of quite different systems can be constructed to conform to them 
and each such system is called a representation of the model. Within a 
model of information retrieval there will be models, or submodels , for 
each of the logically distinct processes which together comprise the 
system. Thus, for example, there vail be a classification model if 
classification is required as an adjunct of the retrieval system. There 
will be an information model describing the categories of information 
required by the system, together with statements of the assumptions made 
about these categories. There will be a relevance model containing 
statements about the way in which external relevance is simulated by the 
system. Finally, there may be an evaluation model containing statements 
of the assumptions made about the evaluation of the output of the re- 
trieval system. Once the model has been defined it may be found that 
experiments can be designed to test the validity of the complete model, 
that is all representations within the model, rather than the validity 
of a single representation. 

Section 1.1 contained an informal specification of part of a 
particular model of information retrieval in its reference to the re- 
levance and association hypotheses (l.l.l and 1.1.2) and the inter- 
relation of the four categories of information listed in that section. 

In this model, the classification and match function are used together 
with the document descriptions to simulate the external judgments of re- 
levance for the base set of requests. The information contained in the 
relevance table is disregarded in the construction of the system and is 
used post facto, evaluatively, to corroborate or contradict the result 
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of the classification algorithm or the match function. Instead, suppose 
that the requests are well-formed, in that they genuinely represent the 
requests which the user intended to submit to the system, and that the 
table of relevance appropriate to those requests is incontrovertible. 

The match function may be defective but is assumed to be at least a 



quest is taken in turn and a classification of terms is gradually con- 
structed which will confer a high match coefficient on those documents 
which are judged relevant. Each constructive step consists of an opera- 
tion altering the membership of selected terms to selected classes of 
the classification. Furthermore, it is ensured that a subsequent altera- 
tion of the classification to accommodate the relevance judgments of 
another request does not affect the classification in a way which would 
be detrimental to the requests already examined. The classification so 
constructed deserves the name only insofar as it consists, de facto , of 
classes of terms. There may be little reason to assume that such a 
classification would consist of classes of terms which would represent 
generic concepts associated with the document collection. Subsequent 
analysis, for example, by comparison with classifications generated from 
term co-occurrences, might demonstrate that this is so. In the absence 
of this, however, choose to draw a distinction between these two 
classifications by referring to the former as a pseudo-classification. 



and it is to be regarded, for the moment at least, as a purely formal 



in a strong position. By construction, performance for the base set of 
requests will be the best attainable for all representations of the 



distant approximation to our intuitive notion of relevance. Each re- 



construction. Provided, such a classification may constructed, we are 
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classification model, given the particular model of retrieval. A sub- 
sequent request, one v;hich does not belong to the base set, will be 
examined by the system and will be retrieved providing the match coeffi- 
cient is sufficiently high. Of course, there xs no absolute guarantee 
that the documents retrieved in this way will be judged relevant but the 
same criticism may be leveled at any other system. This point is elabo- 
rated in the next section. 















SECTION II 

2. THE USE OF PSEUDO -CIASSIFICAT IONS IN RETRIEVAL 

2.1 The Predictive Use of Pseudo- 
Classifications 

The procedure which has "been outlined constructs pseudo- 
classifications, which produce the best performance when used in con- 
junction with the apparatus of information retrieval. A distinction has 
been drawn between a pseudo-classification and a classification, for the 
former is a pure construction, a deus ex machina , while the latter is, 
as far as current classification research permits, a more fundamental 
grouping based on the resemblances between terms. The problem arises 
of predicting the effect of a pseudo-classification on retrieval per- 
formance when a new set of requests, distinct from the set used in con- 
structing the pseudo-classification, is offered to the system. There 
is formally no guarantee that the system will respond in anything but a 
perverse way, for it must be clearly remembered that the pseudo- 
classification is derived from a particular set of requests and a par- 
ticular set of relevance judgments over the whole document collection. 

We hope that the procedure has somehow generalized the notion of re- 
levance from a number of specific instances of relevance of documents to 
requests and has enclosed it in the pseudo-classification in such a way 
that the system may predict which documents boar the same relationship 
to subsequently presented requests as certain prescribed documents bear 
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to the base set of requests. It is as if we have tried to distil re- 



levance itself from repeated analogical statements such as "the rela- 
tionship between D1 and R1 is the same as the relationship between D2 
and R2" where it is known that document D1 is relevant to request R1 anc 
document D 2 is relevant to request R 2 . Although it is hoped that gen- 
eralization has taken place, this cannot be assumed as the following 
hypothetical set of requests and relevance judgments demonstrates. 
Suppose that each request is disjoint from all other requests, each re- 
quest has one document relevant to it and no document is relevant to 



more than one request of the base set of requests . A possible pseudo- 
classification for this configuration would be represented by the 



assignment of each term to a class by itself, and such a configuration 
obviously would not extend to an arbitrary request. Although experimental 
demonstration is required to show that this is unlikely, there are a 
number of arguments which might constitute a plausible defence against 
this eventuality. The first point is that the match function is de- 
signed to reflect an intuitive notion of relevance as formulated in the 
relevance hypothesis ( 1 . 1 . 2 ). The classification is constructed by 
operations involving the assignment of terms to existing classes and the 
creation of new classes by the assignment of terns to them not yet 



gathered together into a class. Pairs of terms within a class are used 
in retrieval as if they are intersubstitutable and it is the intersub- 
stitutable pairs which yield good retrieval performance which we hope to 
locate. Although there may be some choice available in the selection of 
which terms and classes to operate upon and although these terms may be 
brought together into such classes randomly in the event of no other 
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basis for decision, it is not unreasonable to suppose that, as the 
classification develops it will become more determinate in that the 
opportunity for random assignments will diminish since the number of 
assignments to be made is finite. Nor is it unreasonable to suppose 
that the later assignments of terms to classes may have the effect of 
diminishing the possibly deleterious effect of the random assignments 
made during the early stages of the construction of the classification 
providing we allow terms to be removed from classes • In addition, we 
may imagine that a large sample of req,uests is taken as the base set of 
requests. Providing the sample is large enough, it is unlikely that a 
new request, in flU- probability similar to one in the base set, would 
produce a radically different set of documents on retrieval. However, 
this does raise the question of what constitutes a sufficiently large 
sample of requests. The problem of extending the system to deal with 
new requests will be referred to as the problem of g eneralizing. 

The only satisfactory solution is to choose a base set of re- 
quests in such a way that the set is representative of all future re- 
quests which may be submitted to the system. There are a number of 
ways of doing this. Suppose, for example, that it were possible to 

a statement about the distribution of terms in the requests which 
may be submitted to the system. In this case probabilities could be 
assi gne e to the output of the system to indicate that, althou^ a docu- 
ment may be deduced relevant to a request, it is only guaranteed re- 
levant by the system with the specified probability. Note that this is 
not the same as degree of relevance since the document may be entirely 
irrelevant to the request. This is a little unsatisfactory in the light 
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of the system's anticipated good response to requests of the base 
set. 

Another possibility is to require that each request satisfy a 
condition which would be phrased as generally as possible^ and the pseudo- 
classification would be constructed only with .requests satisfying this. 

The system in turn would be required to respond only to requests which 
were of this type . Such a condition may contain a statement about the 
co-occurrence of terms within requests and we may, therefore, find our- 
selves in an inconsistent position. We attempt to base a retrieval sys- 
tem on considerations which do not involve classification techniques 
per_^e_, but we require some kind of classification in order to ensure 
that the procedure will be capable of generalizing to subsequent requests. 

A final remark in this connection is that there may be available 
a large collection or requests and the associated relevance judgments, 
and that although it cannot be demonstrated categorically, they are 
highly likely to represent a typical and representative sample of all the 
requests which may be put to the system in future, in that all the dif- 
ferent kinds of request are adequately represented. Although this re- 
mark is pragmatic and formally indefensible, it may happen in a practi- 
cal application that this state of affairs nevertheless obtains. 

2.2 orhe Evaluative Use of Pseudo-Classifications 

We shall now consider the use of which pseudo-classifications may 
be put in evaluating retrieval performance. Consider the following ex- 
periment. An information retrieval system is designed which corforms to 
some model. A match function is specified which is intended to measure 
the relevance of a document to a request. The design of this function 
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is ba.s6d. on intuitive ideas about relevance and is an internal analogue 



of relevance as judged by the users of the system. The function may be 
good or bad, according to its success in retrieving relevant and only 
relevant documents, and may be subject to change and modification as our 
ideas about the internal representation of relevance change. We also 
possess a number of algorithms for producing term classifications con- 
forming to a model of classification. We want to isolate the classifica- 
"tion which, when used in conjunction with the match function, results in 
higihest retrieval performance. These then are the pieces of apparatus 
required for the experiment. There remains, however, the problem of 
evaluation. Althou^ the use of precision-recall curves for this pur- 
pose seems to be well established in the literature, there is still 
criticism of them and they should, therefore, be regarded only as a 
temporary solution. There is no one method which is generally accepted. 
For example, Cleverdon and Keen ( 7 ) computes recall and precision with- 
out regard to the order in which documents are retrieved. Salton (l), 
on the other hand, is interested in determining whether the relevant 
documents are retrieved first. Swets (6) proposes a decision theoretic 
approach . 

Suppose now that a certain match function is decided upon for the 
experiment. Classifications are produced by a number of algorithms and 
we wish to find which of these yield the best retrieval performance for 
a particular way of measuring performance. Each of these classifications, 
in turn, is used in the retrieval system, and the resulting performances 
are compared in a simple way to determine the best. With luck the results 
may suggest a variation in classification technique and by experimenting 
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of this kind it is hoped eventually to arrive at a classification 



algorithm which will give the best performance. 

There is an obvious difficulty in this approach. For a given 
document collection, base set of requests and table of relevance there 
is a best possible level of performance, according to some measure of 
performance, the given match function and the classification model. 



The measure of performance customarily used do not relate the performance 



of a retrieval system to the best theoretical performance for the re- 



trieval model, with the result that there is no indication of the extent 



to which the system may be improved or the direction in which such im- 



provements may occur. Suppose, however, that it were possible to ex- 
amine all the classifications in turn which satisfied the classification 



model in a retrieval system with a given match function and a given 
measure of retrieval performance. The maximum theoretical performance 
for the model would be given by the classification which resulted in 



performance better than any of the other classifications, since all 
possible representations in the model would have been examined in order 



to make this assertion. Such a cla^ssification is a pseudo-classification 
since it is the one which agrees most closely in retrieval with the re- 
levance table for the base set of requests. Moreover, the pseudo- 



classification does not depend upon the measure of retrieval performance 
but more directly on the extent to which the relevance judgments set out 




in the relevance table have been reproduced by the system. The effect 



of the pseudo-classification may be assessed subsequently by choosing a 
particular measure of retrieval performance which uses this information. 
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and the numerical values which result will represent the best possible 
performance obtainable with the given models* 

The problem, however, cainJact be approached in this way for al- 
though the number of partitions, and, therefore, the number of classifi- 
cations, of a finite number of terms is itself finite, the amount of 
computer time needed for the evaluation of them all would be prohibitively 
large. Instead of enumerating them all, we need a method for isolating 
those which are the best for the particular retrieval experiment. The 
point here is that we are no longer interested in the extraction of 
classifications in the sense of coherent groups of terms. We are only 
interested in formal groups of terms which result in good retrieval 
performance because the hypothesis that a particular classification 
technique is the best for a specific retrieval experiment is precisely 
the hypothesis which we mean to test. These partitions are the pseudo- 
classifications, an algorithm for the construction of which has been 

alluded to earlier in Section 1.2. 

The following way of measuring retrieval performance is siiggested 
by the foregoing remarks. A particular match function and performance 
measure are chosen. A pseudo-classification is constructed (by 
algorithm rather than by selection from a complete enumeration of the 
representations of the classification model) which, when used in re- 
trieval, gives rise to a particular level of performance. This is the 
theoretical best for the data, for the classification and retrieval 
models, and for the match function. The proposed measure of performance 
measures the departure of the performance in a particular case from this 
theoretical best. If this departure is small, little improvement may be 
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gained by changing the classification technique and it would indicate 
that further experimenting in this direction would not be profitable. 

The introduction of another match function, however, would result in a 
different pseudo-classification and a different estimate of the theoret- 
ically best performance • If this were better than the optimum for the 
previous match function, then the choice of match function may be re- 
garded as more useful in retrieval than the previous one in that it 
permits the system, at least theoretically, to achieve an improvement 
in performance. If the measure of retrieval performance were changed it 
would not be necessary to regenerate the pseudo-classification since 
this does not depend on the formulation of the measure. It is in this 
area that pseudo-classifications are particularly applicable; tliat is, 
in an evaluative rather than a predictive role. The following defini- 
tion is, therefore, made; 

2.2.1 The difference in the performance in retrieval of the pseudo- 
classification and the automatically generated classification (accord- 
ing to some measure of performance) gives an estimate of the improvability 
in performance of the retrieval model for the match function and classifi- 
cation algorithm which have been applied. 

Although the use of this performance measure (2.2.l) may indicate 
whether it is more profitable at a particular stage of research to im- 
prove performance by changing either the classification algorithm or the 
match function no indication is given of how this is to be done. We are 
in the odd position of knowing which classification of the model gives 
the best performance yet being unable to supply a means of deriving it 
without recourse to the relevance judgments of the base set of requests. 
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In spite of this, its usefulness as an evaluative device is not impaired. 
It does not replace the researcher; it assists him to he more precise 
in his analysis of a particular experimental model. 

2.3 The Isolation of Inconsistencies in 
Relevance JudCTients 

The practicability of constructing a pseudo-classification must 
now be raised. It may happen that after the classification has been 
constructed to give the correct response to a number of requests of the 
base set, a request is encountered whose processing conflicts diametrical- 
ly with some of the conditions set up to prevent deterioration of the 
classification with respect to previously processed requests. It is 
possible that this results from an inconsistency in the judgment of re- 
levance of documents to requests as supplied by experts in the field 
covered by the document collection. However, we expect that the assess- 
ment of external relevance by an individual does not lead to serious 
inconsistency and that the same is true for the determination of relevance 
by consensus of opinion. Detailed work by Re snick and Savage ( 8 ) on the 
consistency of human judgments of relevance support this view. It may 
also transpire that the number of conditions which have to be constructed 
to prevent deterioration, or the number of decisions which have to be 
taken and are represented by the assignments of terms to classes, in- 
creases at such a rate with the mmiber of requests processed that the 
classification becomes over determined at an early stage of construction. 

It is hoped to show that this does not, in fact, happen. The best which 
can be done with a request which leads to an inconsistency is to reject 
it and tolerate a small decrease in performance over the base set of 
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requests. It would be of interest to determine the proportion of re- 
quests of this type to successfully processed requests and to examine 
the conditions which gave rise to the inconsistencies. The origins of 
the inconsistencies may, of course, be difficult to locate, but at least 
it will be known that a particular document -request pair cannot be 
manipulated in the pseudo-classification to satisfy the conditions set 
up for earlier pairs without impairing retrieval performance. This in 
itself is of interest as a guide to subsequent work on the construction 
of classification and match algorithms within the retrieval model. 

On a purely practical level, there is little defence against in- 
consistent judgments of external relevance. If relevance judgments were 
to be completely idiosyncratic, a retrieval system would have to be 
constructed for each user, and the only way forv/ard would be by inter- 
active techniques. If relevance judgments were to be arbitrary, no 
system could function at a predictable level of performance. A fixed, 
that is, non-interactive system relies on a consensus of opinion of 
users as to the relevance of documents to requests. 













3.1.1 



D the set of documents defined extensionally by terms. 
This is the document collection. 



3.1.2 



R the set of requests defined extensionally by terms. This 
is the base set of requests. 



3.1.3 



Z the set of requests defined extensionally by the documents 
relevant to them. This is the set of relevance judgments . 



3.1.4 



C the set of classes defined extensionally by terms. This 
is the term classification. 



These are similar in that each defines one set of elements extensionally 
in terms of another set of elements. Each, therefore, may be regarded 
as a rectangular incidence array giving the occurrence of an element of 



SECTION III 



3. A RETRIEVAL MODEL 



3.1 The Information Model 

An information model (vide 1.2) is a statement of the categories 
of information used to describe the system followed by a statement of 
the assumptions made about these categories. Within this model, 
£Llgorlthms may be designed to perform specified operations and they must 
not refer directly or indirectly to information which is outside the 
model. 

Section 1.1 referred to the four sources of information required 
by an evaluative retrieval system. These are: 
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one type with an element of another type. The sign *_* indicates that 
we are considering a set of elements defined in terms of another set 
(i.e., an array) rather than a single element defined in terms of a set 
of elements. Attention is confined to docianents indexed by using simple 
term coordination and document collections indexed probabilistically, for 
example, by the methods of Maron and Kuhns (9)^ are not considered. The 
reasons for this are threefold. First, the document collection used is 
not indexed probabilistically, although some attempt might be made to 
rectify this. Secondly, it is felt that tests with undifferentiated 
terms logically should precede any experimenting with term weighting in 
order first to establish a basis for comparison. Thirdly, the construc- 
tion of pseudo-classifications is simplest in this case since a single 
operation on the classification involves the dichotomous choice between 
the removal or insertion of a term. The probabilistic case is more com- 
plex since the choice of a value to be assigned to the weight is no 
longer dichotomous. 

The terms which specify the requests also appear unweighted as 
do, for quite different reasons, the documents relevant to the base set 
of requests. It may be possible to construct a scale of relevance and 
assign degrees of relevance of documents to requests according to this 
scale. Instead, those docianents have been selected which have been 
given the highest degree of relevance. The scale of relevance, there- 
fore, as applied externally, has two values, namely, relevant and not 

relevant. It is realized that the external judgment of relevance is 
more complex than this and that there is no simple division of dociaments 
into those relevant to a request and those not relevant to a request. 

The insertion of a third category would be more realistic, and such a 
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category, namely ”of unspecified relevance” would serve to separate the 
polarities of relevance more clearly. The quantification of the degree 
of relevance of documents to requests has further difficulties of quite 
a different nature. Suppose, for example, that it were possible to 
place in order of increasing releveuice to a particular request all the 
documents of the collection. As soon as numerical values are assigned 
to each document to quantify its degree of relevance, metric properties 
are assumed about the scale of relevance. During the course of re- 
trieval, arithmetic operations on degrees of relevance are performed 
which tacitly assume the truth of statements like, for example: 

"document Dl, whose degree of relevance to a request H is 1 is l/j times 
as relevant to R as document D2, whose degree of relevance to R is J.” 
Until a metric has been established for degrees of relevance such state- 
ments remain indefensible. Re snick and Savage (8) have proposed to 
make a study of this question. For these reasons it is decided to work 
with categories of relevance rather than degrees of relevance; that is, 
with a qualitative scale rather than a quantitative scale. The two- 
valued scale has been adopted since the main body of data for testing 
purposes is reducible to this form. 

3.2 The Classification Model 

The model by which classifications to be constructed are con- 
strained is as follows. Membership of terms to classes is a binary 
property; the object either belongs or does not belong to a class. The 
probabilistic assignment of terms to classes is excluded. All assign- 
ments of objects to classes as a priori independent and overlapping 
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classes are allov/ed and indeed expected. Finally, the classification 
is non-hierarchical. It should be noted that the classes of the classifi- 
cation are defined extensionally. Terms are not assigned to classes 
according to their satisfying a known condition on the class. It may, 
nevertheless, transpire that classes have useful properties in terms 
of the character of the vocabulary. 

The model of retrieval, therefore, is such that all sets en- 
countered are defined extensionally and non -probabilistically. For ex- 
planatory purposes v/e shall refer to 3»1»2, 3»1«3, and 3.1.4 as 

the retrieval environment . This is slightly different from the way 
this term has been used elsewhere ( Jackson (lO)), but throughout the 
remainder of this paper it will be used consistently with this meaning. 

3»3 The Relevance Model 

The match functions which will be applied will only make use of 
the information contained in or derivable from the environment. We 
shall, therefore, write: 

3 . 3.1 1 = M(D, R, f(D, C), f(R, C)) 

where 1 is the match coefficient corresponding to the match function M 
applied to an arbitrary document D in D and an arbitrary request R in R 
using the classification C. f is a function which produces from a 
description of a set specified using terms a description using classes. 
f(D, C) is equivalent to S(D, 0, C) defined by Jackson (lO). Formula- 
tions of f will be given in Sections 5*2 and 6.1. The purpose of f is 
to provide a means of recovering in class matches the term matches 
vrhich were missed on simple matching of the term descriptions for R 
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and D because a ‘berm was used in one of these and a variant of this 
term in the other. The relationship between these terms may or may 
not be one of actual synonymy in natural language. The intention is 
that the classification should contain classes of terms which are mutually 
intersubstitutable and result in good retrieval performance. 

In accordance with the relevance hypothesis (1.1.2) we require 
that M should be a monotonically increasing function of the number of 
terms or classes in common between R and D. Its behaviour with respect 
to the terms belonging to one of them but not to the other is not 
specified. This is a subject for precise formulation in a specific 
realization of M which satisfies the conditions mentioned. The modifi- 
cations to the pseudo-classification as it is being constructed will 
be seen (vide 4.3) to involve the addition of terms to classes already 
defined or the grouping of terms to form new classes. The size of 
the classes and their number will, therefore, vary during construction 
and no particularly relevant interpretation may be put on them. In 
addition, if the match function is allowed to depend on them explicitly, 
the classification will be unalterable or will certainly deteriorate 
with respect to already processed requests as new requests are examined. 
The dependence of the match function on these two quantities is, there- 
fore, explicitly proscribed. 

The values resulting from applying the match function to a 
document which is absolutely relevant to a request and to a document 
which is absolutely irrelevant to a request may be specified at \t±H, 
Although absolute irrelevance is not as precise an intuitive concept 
as absolute relevance (for one can always find some reason for saying 
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that a document is slightly relevant to a request, whatever it is, in 
a collection of restricted subject matter), we shall rest content 
simply to regard it as being that relationship which exists between 
document and request and which obtains at the lowest value of the match 
coefficient. A corollary of the relevance hypothesis is that this 
coc^fficient must increase monotonically to its maximum value which rep- 
resents absolute relevance. We shall, therefore, require that the 
bounds of the match function be finite and that these bounds are attained, 
at least in theory, by the function. The additional conditions on the 
match function are, therefore; 







3»3»2 M is a monotonically increasing function of the nimiber of 

terms or classes in common between R and D. 



3 •3*3 M is independent of the size of any class in C and of the 

number of classes in C. 
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3 *3 *4 1 is bounded above and below, and attains its bounds. 

It might happen that once the classification has been constructed 
using the base set of requests that it is still under determined. Addi- 
tional requests and their appropriate relevance judgments could be used 
I to complete the classification only if the match function does not de- 

L 

[ pend on the nunrt)er of requests in the base set. Similarly, additional 

documents could only be added to the collection provided the match 

I function does not depend on the size of the collection. Since both of 

I 

j these are valuable properties of a retrieval system we shall add the 

appropriate conditions on M; 

3»3»5 M does not depend on the size of R. 
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M does not depend on the size of D, or on the size of the 
term vocabulary. 



Two examples of match functions are given by Jackso n (lO). 



So far, there is no criterion of relevance in the model. Any 
criterion is bound to be arbitrary to a certain extent, for it is never 
possible to have complete knowledge either of the document collection 




or of the mind of the user of the system. To use the complete document 
unprocessed is as far from the solution as hoping to provide a com- 
plete analysis of the docimient, revealing in complete detail the com- 
plexity of the structural, and semantic relationships between all the 
linguistic elements in the document. We have to make do with approxi- 
mations, which we hope will reveal the salient features of the col- 



lection for the purposes of automatic retrieval. In the model, the 
upper and lower bounds of the match coefficients which represent the 
polarities of relevance are known. We have attempted by 3 » 3»1 to es- 
tablish inside the model a scale of relevance between these poles and 
somewhere along this scale we must define a value, above which we re- 
trieve documents and below which we suppress them. This value is 
called the critical value of the match coefficient. Categories of 
relevance are assigned, therefore, to retrieved documents, rather than 
degrees of relevance. In the absence of any evidence to the contrary, 
it will be assumed that the values of the match coefficient are dis- 



tributed over the document collection in such a \Ta.-y that the critical 
value is the value midway between the extreme of the match coefficient, 



Subsequent experiment may cause us to revise this assumption. Thus, 
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for deciding on a suitable value for the critical point in the scale 
of relevance, the following h;>^othesis is made: 

3.3.7 The critical value is the arithmetic average of the upper 
and lover extrema of the match coefficient. 

This will be referred to as the critical value hypothesis . 

The following notation is introduced in connection with the above 
assumptions. Suppose that l"*" and l" are the upper and lower bounds of 
the match coefficient 1, respectively, and suppose that 1^ is the 
critical value of 1 . In accordance with 5 o* 7 ^ 1 q is defined as: 

3.3.8 1q = -^{1* + 1*) . 

We shall use the binary asymmetrical relation @ and ^ to denote ‘relevant 
to* and *not relevant to*, respectively. Thus: 

3.3.9 1 > 1 q <=> D @ R 

and 1 < 1 q <=> D ^ R . 

For completeness, we also introduce the binary asymmetrical relations 
@* and ^ to denote ‘absolutely relevant to* and ‘absolutely irrelevant 
to*, respectively. Thus; 

3.3.10 1 = !■*■ <=> D R ' 

and 1 « 1 <=C> D ^ R . 
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SECTION IV 

4. METHODS FOR CONSTRUCTING PSEUDO -CLASSIFICATIONS 
4.1 Two Approaches 

Within the model of retrieval set out in Section 3 two approaches 
in the construction of pseudo-classifications may he distinguished. The 
first approach may he regarded as an attempt at an analytic solution 
hy determining the * inverse* of the match function. This is seen more 
clearly from 3»3»1 which, in the context of Section 3»3> defines the 
match coefficient in terms of other elements of the model. This re- 
lationship is open to more general interpretation in which it is re- 
garded as an equation connecting 1, D, R, and C hy the functions f and 
M, which are given. Providing only requests of the base set are con- 
sidered the values of 1, or at least their magnitudes with respect to 
the critical value of the match coefficient, are known. In the con- 
struction of pseudo-classifications, C is unknown so that 3*3 •! may 
now he interpreted as an implicit definition of C. Such a solution, 
however, depends on whether M possesses the requisite algebraic 
properties which enable the inversion to he performed. Even if these 
properties were known, the method would probably reduce to an amount 
of matrix manipulation — for D, R may be regarded as rectangular binary 
matrices --which would make such a solution uneconomic for the scale of 
document collection we hope to process eventually. 



27 















28 

An alternative solution is by the method of perturbations , which 
has the advantage that less analysis of the match function is recjuired* 
The classification is subjected to a number of alterations involving 
the insertion and deletion of terms from classes until the appropriate 
response to the base set of requests is elicited from the system* The 
alterations are carried out by perturbation functions . The difficulty 
with applying perturbations to the classification is to ensure that 
the method converges to a solution. Not only must convergence be proved 
but it must also be sho\>m that the limit of this process is attained 
and is the required classification. For the moment we are content to 
outline a method which at least converges, 

4,2 General Outline of a Method 

Suppose that it were possible, by a suitable perturbation, to 
cause the system to give the correct response, relevant or not relevant, 
to a specific document -request pair. With a suitable convergence 
theorem it would be possible to treat each document -request pair in- 
dependently of the others. The response for one document -request pair 
might be destroyed or impaired by later adjustments to accommodate 
other pairs but at worst it would be necessary to examine each pair a 
number of times. The convergence theorem would ensure that although a 
number of responses may be impaired, gradual convergence would neverthe- 
ness set in. Without such a theorem, it is necessary to use a method 
in which each document -request pair is examined once only and in which 
conditions are set up to prevent a response from being obliterated by 
the processing of subsequent pairs. The degrading of responses by 
adjustment of the classification to accommodate subsequent document- 
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request pairs will be referred to as deterioration * The conditions 
necessary to prevent the destruction of a response are referred to as 
deterioration conditions . 

It is now possible to give an outline of a method for construct- 
ing pseudo -classifications: 



4.2.1 i. I^t C* be the state of the pseudo-classification. 

ii» Perturbations are applied to C* until the correct re- 
sponse is given by the system“to a particular (D, R). 
The state of the pseudo-classification is then C|' . 

ill* Deterioration conditions are set up for (D, R) with re- 
spect to C*' . 



iv. The process is repeated for the next (D, R). 



4.3 Design of Perturbation Functions 



Deterioration conditions for the method described in 4.2.1 are 



effective only if all the perturbations which might lead to deteriora- 
tion have been examined. It is, therefore, necessary that the pertur- 
bations which affect the match function should be exhaustively enumerated 
and that there should be few enough of them to make the method practi- 
cable. For each document- request pair, all perturbations which might 
lead to deterioration must be examined. We shall require, as a purely 
practical constraint on the method, that the number of deteriorating 
perturbations be less than the number of document -request pairs to be 
considered. The edgorithm for constructing pseudo-classifications is, 
therefore, less than order two in the number of document -request pairs. 
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The operations which are appropriate to altering the classifi- 
cation are: 

^•3.1 Assignment of terms to classes. 

-> < condition on {x>^, (y}^ > for all i < p 

4 . 3.2 Removal of terns from classes and their assignment to other 
classes. 

(y^^ < condition on fx}^, fy}^^ fz}^> 

for all i < p 

where {x}^is an ordered set of p terms and {y}P, fz}Pare ordered sets 
of p classes, and where the conditions limit the choice of operands 
for a and r. y^^ are the i-th members of the sets {x}^, fy)^, 

{z} , respectively, p is called the step length of perturbation. A 
complete enumeration of the perturbations with step length P which 
affect the classification involves the enumeration of all perturbations 
of smaller step lengths (i.e., P-1, P-2, • . ., 2, l). 

Another possible perturbation is the sinrple removal of p terms 
from p classes. This, however, might result in the complete evacuation 
of terms from the classification. The removal of terms from classes, 
however, is provided for in 4.3.2 and since in this case terms are re- 
assigned to classes, there is no possibility of complete evacuation of 
the classification. For these reasons the sin 5 >le removal of terms 
from classes without reassignment is not considered. 

The effect of 4.3.1 on the classification is to classify terms 
(i.e., to insert terms into classes of the pseudo-classification) while 
the effect of 4.3.2 is to reclassify terms (i.e., to redistribute 
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terms amoiig the classes of the pseudo-classification). A further dis- 
tinction may be dravm. The classes mentioned in 4.3.1 and 4.3.2 are of 
two kinds. A class may €Q.ready exist in the classification so that the 
effect of 4.3*1 and 4.3*2 is to add or remove terms. The effect, there- 
fore, is to alter the consitution of a class. The number of classes 
in the classification remains unchanged or decreases. The classes may, 
however, be new in the sense that it is only when the functions have 
been applied that the classes enter the classification. They are not 
modifications of already existing classes for the number of classes 
in the classification increases. The effect of 4.3.1 and 4.3*2 when 
applied to such classes is to create new classes within the pseudo- 
classification* These distinctions are made use of in Section 4.5 
where a method is proposed for selecting the appropriate perturbation 
to apply to a given document -request pair. 

The choice of perturbation is further determined by the re- 
sponse which must be simulated for a given document -request pair. 
Suppose, for example, that the response of the retrieval system to the 
pair (D, R) is * not -relevant ' . Suppose further that * relevant* is the 
correct response. A perturbation must be applied to the classification 
which has the effect of increasing the value of the match coefficient 
for (D, R) to a level greater than the critical value. To facilitate 
this selection it is, therefore, important that the perturbations 
should be grouped into those which increase the match coefficient, those 
which leave the match coefficient unchanged and those which decrease 
the match coefficient. The following terms are, therefore, used: 
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Increasing^ perturbations have the erfect of increasing the 
match coefficient for a given document-request pair. 



Level perturbations have the effect of leaving unchanged the 
match coefficient for a given document-request pair. 

Decreasing perturbations have the effect of decreasing the 
match coefficient for a given document-request pair. 



Once perturbation functions have been grouped according to 
their effects on the match function, it is clear how the deterioration 
conditions should be determined. If it is necessary to apply an in- 
creasing perturbation to accommodate a particular document-request pair, 
then conditions must be set up to inhibit the application of the de- 
creasing perturbations v/hich may lower the match coefficient below the 
critical value. Similarly, if decreasing perturbation is applied then 
conditions must be set up to inhibit the action of increasing perturba- 
tion functions. 

The number of perturbations of the form 4.3*1 and ^l-.3*2 increases 
with the step length p. In order to satisfy the practical constraints 
on the model only single step (p = l) perturbations will be examined 
in detail. 



1 

T 

T 

T 



4.4 Deterioration Conditions 

Suppose that t- is a term and that is a class in the classi- 
fication. It follows from the model of classification defined in 
Section 3.2 that either t. e C. or t. ^ and that membership is non- 
probabilistic. The classification C may, therefore, be represented by 
a binary array indicating the incidence of terms .in classes. Thus> 



o 



















33 



4.i+.l 





= -1 <=> j. C^ 



where C is used to denote both the classification and the array. During 
the construction of the classification, however, it is useful to use 
another possibility of membership of terms to classes. The value 

a 0 is to imply that on the basis of the information used so far. 



a later stage of construction a definite choice may be made. During 
the construction of the classification the array is tri-valued. The 
values 1, 0, -1 are referred to as status values . 

During the construction of the pseudo-classification, perturba- 
tions are applied which affect the membership of terms to classes and 
accordingly change the corresponding status values. The change in the 
status value is called a transition and is defined as: 



The transition table Tj^j gives the set of permitted transitions and is 
defined by: 



no decision may be taken about the membership of tj^ to Cj although at 



4.4.2 



e^ ej status value e^^ is changed to status value ej 
in the pseudo-classification. 



4.4.3 





v;here e^^, e^ c (-1, 0, l). From 4.3*1 and 4.3.2 the single step 
perturbations are of the form: 



4 . 4.4 



a(x->y); < conditions on (x, y) > 















^•^.5 r(x:y-» z); < condition on (x, y, z) > . 

The deterioration condition associated with 4.4.4 is *x must never be 
assigned to y* . This is achieved by forbidding the membership of term 
X to class y and by forbidding any transition from the status value -1. 
The effect on the pseudo-classification is, therefore: 

= -1 where x = tj^ and y = Cj 

and the required values in the transition table to ensure that this is 
never revoked are: 

T- 1,1 = 0 and T_i^q = 0 . 

The change to the classification is effected if 

%-i = ^ • 

The deterioration condition associated with 4.4.5 is *if x is in y then 
X may not be assigned to z* . This condition is recorded in the con - 
dition table Q(k:i, j) defined by: 



4.4.6 


Q(k:i, j 


) ts 0 <a> r(tj^ : Cj) allowed. 




Q(k:i> i 


) = 1 <=^ • ^i allowed. 



It is important to note that a condition of this sort may not be re- 
voked in a later stage of the construction of the pseudo-classification. 
That is, within the condition table the change from 1 to 0 is not 



allowed. 
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The effect of ‘t.4.4 on the pseudo-classification is: 



^ij “ ^ where x « and y = Cj 



and for this to be possible the transition 0-^1 must be allowed: 



"^ 0,1 “ ^ * 



The effect of 4.4.5 on the pseudo-classification is: 



^ki = 0^ Cjg = 1 where x = tj^, y = c^, z = Cj 
and for this to be possible the transition 1-40 must be allowed: 



'^ 1,0 ® ^ • 



The remaining transitions e^ -> e^ for e^ e(-l, 0, l) and 1 -> -i are 
allowed since they are not explicitly excluded by the above analysis. 
These results are collated below. The transition table is^ therefore, 



-1 0 



4.4.7 



-1 

0 

1 




The effect of 4.4.4 on the classification array isi 



4.4.8 



Ci- =1 . 



The effect of 4.4.5 on the classification arrav is: 









:? 
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The deterioration condition for 4.4.^)- is: 



4.4.10 C.. = -1 . 

*^d 

The deterioration condition for 4.4.5 is: 



)6 



4.4.11 Q(k:i, j) = 1 . 

Two functions a' and r' are now introduced which have the effect 
of setting up the deterioration conditions 4.4.10 and 4.4.11 associated 
with a and r, respectively. The functions are called conditional 
perturbations and are defined as: 

4.4.12 a' (x y); < condition on (x, y) > 

whose effect is C. . = -1 where x = t. and y = C. (vide 4.4.10) 

1 J 1 J 

4.4.13 r'(x:y-> z); < condition on (x, y, z) > 

vfhose effect is Q(k;i, j) « 1 where x = tj^, y = C^, z = (vide 4.4.11). 

4.5 Precedence of Perturbations 

It has been seen in 4.3 that perturbations may be grouped in tv/o 
different ways; according to their effec-j: on the match coefficient and 
according to their general effect on the classification. The selection 
of an appropriate perturbation to apply for a given document -request 
pair is determined to a certain extent by a knov/ledge of the system's 
response and the response specified in the relevance table. If, for 
example, the match coefficient for a document -re quest pair is lower than 
the critical value and it is known that the document is relevant to the 
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request then an increasing perturbation function is required. The 
choice of perturbation function is further determined by establishing 
a precedence between perturbations according to their effect on the 
pseudo-classification namely classifying, reclassifying or creating 
(vide 4.3). These three groups of perturbations will be denoted by c, 
r, and n, respectively. This is not, however, an exclusive group- 
ing since a particular perturbation may be, for example, both r-type 
and n-type, that is its effect is both to reclassify and to introduce 
a new class. A complete exclusive grouping consists of the group c 
(classify), r (reclassify), cn (classify cuid create new class) and 
cr (classify and create new class). The type of a perturbation is, 
therefore, defined as: 

4.5.1 type a c or r or cn or rn . 

The first precedence to consider is that between r and c. 

During the construction of the pseudo-classification it is advantageous 
first to attempt to reclassify the terms already classified until no 
further reclassification is possible. At this point more terms are 
admitted to the classification and are subsequently reclassified as 
appropriate. The precedence rule which achieves this effect is; 

r > c . 

The same argimient hold for m and cn and we obtain the rule: 



m > cn . 
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A final requirement is that the classification should he constructed 
with as few classes as possible to achieve the required distinctions 
amons the teimis. This position is approached, at least in principle, 
if the perturbations involving the creation of new classes have the 
lowest precedence. Accordingly, the precedence of the types of perturba- 
tion is given by: 



4.5.2 



precedence « r > c > rn > cn 



The selection of perturbations according to precedence and 
according to their effects on the match function does not lead to a 
unique function. There may be a number perturbations which fulfil the 
conditions and for each of these there may be a number of possible 
choices of operands satisfying < condition on (x, y) > or < condition 
on (x, y, z) > (vide 4.3.1 and 4.3.2). It will be seen that the per- 
turbations cheuige the match coefficient by an amount which is independ- 
ent of the choice of arguments from among those which satisfy the 
appropriate c ondit ions . The re fore : 



4 . 5.3 the arguments for the perturbation are chosen at random 

from the class of suitable operands. 



To prevent the classification from becoming over determined at an early 
stage of construction (in the sense described in Section 2.3): 



4 . 5.4 the smallest number of perturbations are selected which 

together produce the required change in the match coeffi- 
cient. 
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SilCTION V 

5. ENUMERATION OF PERTURBATIONS FOR A 
GENERAL MATCH FUNCTION 




5.1 Scope of the Chapter 

The emueration to be given is of general applicability to the 
class of match functions to be defined. The perturbations to be de- 
duced are predominantly concerned with the specific method of construe 
ing pseudo-classifications which was outlined in 4.2.1, since the 
enumeration to be given will be complete as is required by that method. 
However, the perturbations deduced are relevant to any procedure for 
constructing pseudo-classifications which may be regarded as a ‘method 
of perturbations* as defined in Section 4.1. A match function will be 
taken as an example and a complete list of the perturbations which 
affect it will be given together vith their classification into r-type , 
n-type, c-type , i ncreasing, level, and decreasing . 




5.2 Definition of the Function f 

The purpose of f, described in Section 3»3, is to produce a 
description of a request or document in terms of classes from the 
original term description. In contrast with term matching, in which a 
match is located when a term is found in common between document and 
request, a class match is located when a term from a request and a 
term from a document are found to belong to the same class. The i\mc- 
tion f facilitates the counting of class matches by producing the 
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appropriate set of classes in which matches may he sought. Such a set^ 
of classes is calleu a c^ass^^des^ri^tion of the document or rec^uest. 

It may he useful to permit the production of the class -description from 
a subset of terms of the term-description. For example, since terms 
in common between document and request necessarily lead to classes in 
common between the corresponding class -descriptions, it may be appropri- 
ate, as far as the calculation of class matches is concerned, to ex- 
periment with the residue of the term-descriptions of the document and 
the request after matching terms have been removed. Accordingly, f is 
defined as follows. 

Suppose that A is a set of terms and that A* c:A. Suppose also 
that C is a classification of the terms and that U is a class within 
the classification. Then 



5.2.1 t € A* A UCp^ Uef(A, C) 



^‘(A, C) is called the class -description of A with respect to the classi- 
fication C. 



5*3 Specification of the Class of Match 
Functions 

Suppose that R and D are term-descriptions of an arbitary request 
and document and that R* and D* are their respective class -descriptions. 
Then, 



^For explanatory purposes we prefer to use the word class to refer 
exclusively to groups of terms in the classification. Thus, the classi- 
fication is an organization of terms into classes . The word set is to 
refer only to incidental constructions of terms, grouped together but not 
necessarily forming classes in the classification. Thus, for example, we 
shall call a request R a set of terms , for there is no particular reason 

why these terms should constitute a class in the classification. R' , the 
class -description of R, is a set of classes . 









poiiiiii^^ 



5.3.1 R* = f(R, C) and D» = f(D, C) 

for some f satisfying 5 . 2 . 1 . Two-by-two contingency tables with the 
indicated marginal totals are now defined for R and D and for R* and D* . 





D 


D 


totals 




D» 


D» 


totals 


R 


a 


b 


n 


R* 


a* 


b» 


n* 


5.3.2 - 


c 


d 


N-n 


R» 


c* 


d» 


N» -n» 


totals 


m 


N-m 


N 


totals 


m» 


N»-m» 


N» 



5.3.3 v/here N=a+b+c + dis the number of terms in the 

vocabulary 

and N* = a* + b* + c* + d* is the number of classes in 

the classification. 

Match functions of the form M(a, b, c, d, a*, b* , c», d* ) will be con- 
sidered; this class of match function is consistent with 3. 3.1. Not 
all the arguments, hov/ever, are permitted for it is possible to expose 
an implicit dependence on N and N» by eliminating one of a, b, c, d, 
and one of a* , b*, c*, d* by using the expressions for N and N* given 
in 5.3.3. This is explicity excluded by 3 . 3.6 and 3.3.3. In 
document -request matching d and d» are the least informative variables 
for the measurement of the similarity between the two term-descriptions 
and between the two class -descriptions. Accordingly, only match func- 
tions of the form 

5.3.^ 1 = M(a, b, c, a*, b* , c*) 

will be considered. 



i 





















5.4 Introductory definitions 

The operation ’’remove” defined in 4.5»2 consists of two actions 
which may be separated and considered independently. First, a term 
may be taken from a class to which it belongs, and second, the same 
term may be placed in another class. The combination of these tv/o 
operations is equivalent to the remove operation. These two operations 
will be called take and place and are performed by the t -function euad 
the p-function, respectively. The taking ajid placing of terms with re- 
spect to the same class has a null effect on the pseudo-classification 
and is, therefore, expressly avoided. The t and p functions are defined 
as follows: 

5.4.1 t(x, y) has the effect = 0 on C , where x = tj^ and 

y = X is a term and y is a class of the 

classification C. 

5.4.2 p(x, z) has the effect Cj^ = 1 on C , where x ^ and 

z = Ci where C. is a class of the classification C. 
d d *" 

The relationships between the a and r functions and the t and p functions 



are: 




5o4.3 


a(x z) = p(x, z) 


5 . 4.4 


r(x:y z) 3 t(x, y) p(x, z) 



where the evaluation is carried out from left to right. 

In considering the effect of the t and p functions on the class- 
description B of a term-description A of a document or request, two 
cases are to be distinguished. Suppose that a term is taken from a 
class C which possesses no other terms common to A. The class -description 
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of A will be altered by the complete removal of the class C from it 

and this will have an effect on a* , b*, c*. When the term is sub- 

« 

sequently placed in another class there will be a further change in a* , 
b* , c* . The combined effect of these two changes is the effect produced 
by the remove operation r. Suppose, however, a term is taken from a 
class which has another term in common with the term-description A. In 
this case there will be no change in the values of a', b* , c* although 
the subsequent placing of the term in another class may produce a change. 
The terms of A and the classes of B must, therefore, be grouped accord- 
ing to the effect of the t-function on them. Two conditions are re- 
quired; one is to test whether a specific term, if taken from the class 
y, will cause that class to be omitted from the class -description of 
A; the other is to test whether there will be no such change. These 

p 

conditions are, respectively: 



have only one term in common, namely yM. The value of 5*4.6 is T if 
the class v and the set A of terms do not have one common term— there 



which satisfy the < condition on x >. Conditions are separated by semi- 
colons. N(x1< condition on x >) is the cardinal of the set so defined. 

3 

The values T and F are the values true and false of boolean 
conditions . 



5.4.5 



Li(y, A) = (N(v|veyAA)= l) 



5.4.6 



=(N(v|v€yAA) l) 



The value^ of 5*4.5 will be T if the class v and the set A of terms 



^The notation (x|< condition on x >) means *the set of all x 
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may be none or several. These two conditions are related by 
L3 _(v, a) = L2(v, a) but it is convenient to preserve their separate 
identity. Corresponding to these two conditions are two versions of 
the t -function. The t-function may be applied to a term x and a class 
y which are related to the term-description A in such a way that if x 
is taken from y then y will vanish from the class -description B of A. 
The function which has this effect will be called the t^^-function. 



Alternatively, if x is taken from y there may be no such effect on the 
class-description B of A and the function responsible for this will be 
called the tg-function. These two functions are defined by: 

5.4.7 y) = y)i ^Q_(y> 

5.4.8 tg(x, y) = t(x, y); Lg(y, A) . 



Suppose that X is a subset of A and Y is a subset of B where B 
is the class-description of the term-description A with respect to the 
classification. Suppose also that x is a term belonging to the class 
y. X and Y are defined by: 

^ For any term x belonging to X, there exists a class y 

belonging to Y such that if x is taken from the class y 
then y will vanish from B. 

5.4.10 For any class y belonging Y, there exists a term x belong- 
ing to X such that if x is taken from y then y vanishes 

from B. 

5.4.11 If the class y vanishes from B when x is taken from y, then 
X belongs to X and y belongs to Y. 













^5 






X is called the domain in A of the t^-function. Y is called the range 
in B of the t^^-function. 

Suppose that X is the domain in A of the t^^-function and Y 
is the range in B of the t^^-function. Suppose that the states of the 
classification before and after the application of t^^Cx, y) are C and 
C*^ respectively. Then it follows from the definition of X and Y that 
X and Y have the properties: 



5 .^. 9 ' 



x€X=>ay€Y • y^f(A, c • ) 



5.4.10* y€Y<^ax€X • 7/J^f(A, C * ) 

5.4.11* y^f(A, C*)b>x€X, yeY 



where XqA, Y CB and B » f (A, C) 



and 



5.4.12 X a fer 2 c)fefy)(x|x€A/^; y€B; L^^Cy, A)) = domain (t^; A; B) 

5.4.13 Y a (|fy)(y|y€B; L^(y, A)) = range (t^^; A; B) . 

The range in B and the domain in A of the tg-function are defined 
analogously. 

The domain in A of tj^, the domain in A of t^, the rsuige in B of 
tj^ and the range in B of tg provide the grouping of terms of A and 
classes of B according to the effects of the t-functions. The effects 



^ V is the universal quantifier. Although (x|< condition on x >) 
i.e., *the set of all x*s which satisfy the < condition on x >* uses the 
quantifier V implicitly, the quantifier may be written explicitly to 
display the role of x. Thus, there is no difference in meaning between 
(x|< condition on x >) and ( vx)(x|< condition on x >). 
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of the t-functions may nov; he stated explicitly; 



5 '. 4<»14 



t-| (x, y) decreases the cardinal of the range of tj_ hy 
^ one if and only if x is in the domain in A of 

tj_ and y is in the range in B of tj_, and has no 
effect otherwise. 



5.4.15 t^Cx, y) has no effect on the cardinal of the range in B 



of t2 if X belongs to the domain in A of t2 
and y belongs to the rsmge in B of t2 • 



The four sets are related in the following way. Let 



« range 


(ti; 


A. B) and Yo = range (tr>; A, B). Then, 


5.4.16 




and Yp are disjoint and cover B. That is 
Y1AY2 = p, Y1VY2 = B 


since 

i 


St • 


if Yp is non-empty, there is no term x belonging to 
A which, when taken from any class in Yg, will cause 
this class to vanish from B, 


and 


b. 


if Y-| is non-empty, there is no class in Yj_ which will 
not ^nish from B if a suitably chosen term is taken 
from it. 


5. 4. IT 


Xi and Xp are disjoint but cover A. That is, 

XnAXp 4 p and XtVX 2 = A where Xi = domain 
(tj_; A, B) and X2 = domain (t2> A, B), 



since 



there may exist a term x belonging to Xj_ which, when taken 
from a class in Yj_ causes that class to be removed from B 
but when removed from another class in B does not cause 
that class to va.nish. Thus, x may belong to X2» 



5,5 Effect of t and p Functions on a Document_- 
Request Pair 



Suppose that the document-request pair (D, R) is considered in 
which D is a document and R is a request. D and R are expressed by 
term-descriptions. Suppose also that their class-descriptions are D* 
and R* , respectively, where 



1 



nmmiti 






kj 



r 



5.5.1 D’ = f(D, C) and R' = f(R> C) 



for a classification Cand a function f satisfying 5.2.1. The terns of 
S and D and the classes of R' and D' are separated into sets defined 



as follows: 



5.5.2 



F « RAD 



G » DA.(rAD) 



H = RA(RAD) 



F* e R*AD’ , G' *= D*A(R*AD’) , H’ = R*A(R’AD*) 

Thus, F* contains the classes common to R’ and D’ . H’ contains the 
residue of R‘ after the removal of terms common to D* . G* contains the 
residue of D' after the removal of terms common to R' . F' . G' , H* are 
disjoint. Similar statements hold for F, G, H. From 5.3.2; 

|f| = a, |g| = c, |h| = h and |f*| = a’, 

|G»| = c», |h»| = V . 

The analysis described above will now be applied to the class- 
description R* of R. R* may be partitioned into two sets of classes, 

i 

denoted by and F^, which are defined as foUows: 



5.5.3 FJ^ = iwgE. (tii R, F’ ) 



and 



F^ = range {X2* ^ 



F* and F' are disjoint and cover F' , by 5-^.l6. In addition, F!j_ and 
Fg are defined as follows; 



5.5.^ 



and 



F'£ = domain (tj_; R, F’ ) 



F'2 * domain (t2) R, F' ) . 
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Analogous statements hold for an exhaustive and disjoint partition of 
H* into and with H' replacing F' in 5»5»3 and 5 •5*^ above. 

Similarly, F' in D* may be partitioned into the disjoint sets F^ and Fj^ 
which cover F' . These are defined by; 



5 . 5.5 = range (tj_; D, F» ) 
and 

F^ = range (t^; D, F* ). 

In addition, F" and F" are defined as follows: 

5.5.6 F'^ = domain (tj_; D, F» ) 
and 

F]| s domain (tg; D, F'). 



Analogous statements hold for the exhaustive disjoint partition of G* 
into G^ and G^. With these preliminary remarks, the effect of the 
t -function on (D, R) will be examined. 

Suppose that x belongs to F!J^. Then by 5.4.9* > a class y can be 
foimd such that if x is removed from y then the cardinal of F* will 
be reduced by one. There is, however, another effect if x also be- 
longs to FJJ^. A class will remain in D» which matched with the class 
in F* which vanished with the removal of x, and this class will be 
unaffected by the removal. It will, therefore, become attached to G' , 
with the result that the cardinal of G» will be increased by one. The 
cardinal of H» will be unaffected. If x does not belong to Fj| but to 
Fo the cardinal of G* will be unaffected. 

I Suppose that x belongs to F^. Then by 5.4.9* > a class y can 

! be found such that if x is removed from y then the cardinal of F» will 
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be reduced by one. In addition, if x belongs to F^ a class will re- 
main in R which matched with the class in F' which disappeared with the 
removal of x, and this class will be unaffected by the removal. It will, 
therefore, become attached to H' with the result that the cardinal of 
H' will be increased by one. The cardinal of G', however, will be 
unaffected. If x does not belong to F'^ but to FlJ^ the cardinal of H' 
will be unaffected. 

For any term x in a class y in can be found such that if x 
is taken from y then the cardinal of H' is reduced by one. The cardinals 
of G' and F', however, remain unchanged. 

For any term x in G!J^ a class y in Gj^ can be found such that if 
X is tak:en from y then the cardinal of G' is reduced by one. The 
cardinals of F' and H' are unaffected. 

If a term x in F'^ is taken from a class in F^, there will be no 
effect on the cardinals of F' , G' , H' if x also belongs to FV . 

Similarly, there is no effect if a term in is taken from a 
class in or if a term in Gg is taken from a class in G^. These 
effects on a', b', c' are tabulated in 5.5*8> below. 

5.5*7 Tke increments in a', b' , c' due to the t-function are denoted 
by A|.a', A^.b', Aj.c', respectively. 

5.5.8 Table of the effects of t(x, y); xeX on a', b', c' . The four 
domains F'^, F|;, Hl[, (defined in 5*5.4) and F'^, F||, G’{, G^ (defined 
in 5.5.8) are considered. 





I 

LJ 



1 

5 

5 

1 
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X 


V 


A^b' 


A. c' 

G 


^14 


-1 


0 


1 


CJ 


-1 


1 


0 


^31 


-1 


0 


0 


Fllg 


0 


0 


0 




0 


-1 


0 




0 


0 


0 




0 


0 


-1 




0 


0 


0 



where ; 

5.5.9 F':. = F':AF'! . 

ij 1 J 

The effect of the p-function, defined in 5.^.2, will now he 
considered. Suppose that X is selected from one of F, G, H, and the Z 
is selected from one of F' , G’ , H' , V, W where 

5.5.10 V£CA(R'VD‘) , WfiC 




and V and W are sets of terms and v;here C is the set of classes 
which together constitute the classification. For each selection of X 
and each selection of Z, p(x, z) has an effect on a', b' , c' which 
does not vary with x or z provided that x belongs to X and z belongs 
to Z. The effect on a' , b’ , c' of the p-function for zeW and xeX is 
indistinguishable from the effect for zeV and x^X v/hatever the selection 
of X from F, G, H. W is retained, however, since it is a 'created* or 
'new' class as defined in Section 4.3, v/hereas V is a class of the 
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classification which belongs neither to D‘ nor to R* • It is the dis- 
tinction between a class of the classification and a 'new' class which 

i 

led to the precedence rules established in Section 4.5. For each of 
the three possible choices of X there are five possible choices for Z 
and since these may be made independently, the total number of pairs 
(X, Z) for which the effects of p(x, z); x€X; zeZ on a', b' , c' are to 
be examined is fifteen. It will be seen in 5.5.12 below that some 
simplification may be carried out to reduce the total number of pairs 
to eleven. These account for all the possible single-step p -functions 
which may be applied to (D, R). 

Suppose that a term x which belongs to H is placed in a class z 
belonging to F' or H' . There will be no change in a' , b', c' since 
no classes will have been introduced or removed. The effect of such 
an operation is to change the distribution of the classes among F^, F^, 
HI, 

Suppose that a term x which belongs to H is p3aced in a class z 
belonging to G' . R* will now have an additional class z in common 
with D' . Therefore, the number of matching classes will increase by 
one. The matching class z will be shifted from G' to F', thereby 
decreasing the cardinal of G' by one. 

Suppose that a term x which belongs to H is placed in a class z 
belonging to V or W. The cardinal of H' will be increased by one and 
there will be no change in the cardinals of H' and F' . 

Analogous statements hold for a term x belonging to G which is 
placed in a class z belonging to F' , G' , H' , V or VI. 
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Suppose that a term x which belongs to F is placed in a class 
z which belongs to F' . No new class matches v:ill be made and no new 

classes will be added. There will, therefore, be no chaiige in a', b», 

c' . 

Suppose that a term x v/hich belongs to F is placed in a class z 
which belongs to G' . The class z is, therefore, inserted into both D' 
and R' and the number of class matches increases by one. G', however, 
loses the class z, so the cardinal of G' decreases by one. 

Suppose that a term x which belongs to F is placed in a class z 
which belongs to H' . The class z is, therefore, inserted into both D' 
and R' and the number of class matches increases by one. H', however, 
loses the class z, so the cardinal of H' decreases by one. 

Suppose that a term x which belongs to F is placed in a class z 
which belongs to V or W. A new class match on z will be introduced 
without changing H* or G* . The cardinal of F* will, therefore, in- 
crease by one and the cardinals of H' and G' will be unchanged. The 
results are tabulated in 5.5.12 below. 

5.5.11 The increments in a', b', c' due to the p-function are denoted 

^respectively . 

5.5.12 Table of the effects of p(x, z); x€X; zeZ on a', b', c' for 
all relevant (X, Z). 










Now the effect of p(x, z); xeH; zeG* on a*, b», c* is seen from 
5-5.12 to be identical to p(x, z); xeF; z€G* so these two may be 
contracted into the single p-function p(x, z); xeR; zeG* . 

Sirailarly, the effect of p(x, z); xeG; zeH* on a*, b», c* may be seen 
from 5-5-12 to be identical to the effect of p(x, z); xeF, z€H* so 
these two may be contracted into the single p-function p (x, z); 
xeD; zeH* • This simplification reduces 5-5*12 to: 

5-5-13 Sin5)lified table of the effects of p(x, z); xeX; zeZ on a*, 
b», c* for all relevant (X, Z). 
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mir: 



X 


Z 


A a* 
P 


A b* 
P 


A c* 
P 


H 


R* 


0 


0 


0 


R 


G* 


1 


0 


-1 


H 


V or V/ 


0 


1 


0 


D 


H* 


1 


-1 


0 


G 


D* 


0 


0 


0 


G 


V or W 


0 


0 


1 


F 


F* 


0 


0 


0 


F 


V or W 


1 


0 


0 



5.6 Effect of the r and a Functions on a 
Document "Request Pair 

The effect of the r-function on a*, b*, c* is denoted by A^a* , 

Z^b*, i^c* and the effect of the a-function on a* , b', c* is denoted 

by A^a* , A b* , A„c* . Suppose that the effect of t(x, y); xeX; yeY on 

a*, b*, c* is A a* , A b* , A c* for all terms x belonging to X and all 

0 ti t 

suitable terms y v/hich belong to Y. Suppose also that the effect of 
p(x, z); xeX*; z€Z on a*, b* , c* is ^p&* , ApC* . From ^•k,k it 

is knovm that r(x:y-» z) = t(x, y) p(x, z) where the operations are 
carried out from left to ri^t. Then the effect of r(x:y z); 
xeX"; yeY; zeZ" is given by: 

5.6.1 A^a* = A^a* + A^a* 

A^b* = A^b* + A^b* 

Zi^c* = A^.c* + A^c* 

provided that X"^XAX* and Z"cZ. 
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Also, from 5.4.3 it is kno\m that a(x z) = p(x, z). Thus, the 
effect of a(x -> z); xeX*; z€Z is given by; 



Table 5.6.3 below gives a complete list of the single-step r-functions 
which may be applied to the pseudo-classification. The effects on 
a*, b* , c* are deduced by applying 5.6.1 to the table for the t- 
f unction given in 5«5«8 to the table for the p-function given in 
5 . 5 . 13 * In each case the largest set X" contained in both X euid X* 
and the largest set Z" contained in Z are taken. Table 5.6.4 gives a 
complete list of the single-step a-functions which may be applied to 
the pseudo-classification. The effects on a* , b*, c* are given by 
applying 5.6.2 to the table for the p-function given in 5.5*13. 



5.6.2 



^a* = 



I 




Ab» 

P 



V* = V* 
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5 . 6.3 Table of the Effects of the r-functior* on a', b' , c'. 



See Table 5*5 >8 See Table 3.3.13 

t(x,y); xeXj yeY p^x,z}; xeX* ; zeZ r(x;y z); xeX"; yeY; zeZ 

X Aj-a' A^-b' A^c« X' Z A^a' A^b' ApC' X" Z" A^a' A^b> A^c' 



^4 


-1 0 


1 


R 

D 


G' 

H' 


1 

1 


0 

-1 


-1 

0 


!l4 


G' 

H« 


0 

0 


0 

-1 


0 

1 




ditto 




F 


F« 


0 


0 


0 


11 


Ft 


-1 


0 


1 








F 


V 


1 


0 


0 


11 


V 


0 


0 


1 








F 


W 


1 


0 


0 


11 


T.,T 


0 


0 


1 


^23 


-1 1 


0 


R 


G' 


1 


0 


-1 


p11 

.t23 


G' 


0 


1 


-1 






D 


H' 


1 


-1 


0 


K' 


0 


0 


0 




ditto 




F 


F« 


0 


0 


0 


11 


Ft 


-1 
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5.6.4 Table of the c.'ffects o.t‘ the a-func tilon on .0.’, b*, c* 



See Table 5 - 5- 13 
a(>: -> z); xeX; zeZ 
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The complete algebraic formulation of the perturbations con- 
tained in 5.6.3 is carried out as follows. Consider the first 
perturbation, namely, r(x:y z); xeFlJ^^; yeY; zel* . 

From 5.5.9 F![,|^ = F|J^]| 



from 5.5.4, 6 



from 5.4.12 




= domain 




(tj_; R, F’ ) A domain (t^; D, F’ ) 

(V y) (xeRAy; xeDAy; ycF‘ ; L^Cy, R); L^Cy, D)). 











w. 



mmm. 
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This provides a description of the terms x and classes y which are 
allowed as arg-uments of the perturbation. The complete description is, 
therefore, 



5.6.5 



r(x; y z); xeFAy; yeF' ; Lj_(y, R); Lp(y, D); zeG' 



where from 5*5*2 

G' = D'A(R'Ad« ) 
F« = (R'Ad«) 

and from 5*5*2 

D' = f(d, C) 

R' = f(R, C) . 



The perturbation is applied to the single triad (x, y, z) which satisfies 
the condition xeFAy; yeF' ; Lj_(y, R); L 2 (y, D); zeG' . Any (x, y, z) 
which satisfies this condition will have the effect Aj^a' = 0, 

A^b' = 0, A^c' = 0 on a' , b', c' as set out in 5*6.3* The choice of 
(x, y, z) from among the triads which satisfy the condition is made 
using 4 . 5*3 and 4.5*4. 



5*7 Classification of Perturbations 

The perturbations of 5*6.3 are r-type except those for which 
Z" a V (i.e., z is a new class), which are rn-type. The p^'^tur bat ions 
of 5*6.4 are c-type except those for which Z" = V, which are cn-type. 
The types are defined in Section 4 . 5 * There remains, however, the 
additional grouping into the classes increasing , level, decreasing 
which are defined in 4 . 3 * 3 * Consider the general match function of the 



retrieval model. 



mum: 
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From 5»3»^ 1 = M(a, b, c, a', b' , c') . 

Suppose that A^l is the increment in 1 after the application of an 
r -function. Then, 

5.7»1 = M(a, b, c, a' + A^a‘ , b' +A^b', c' +A^c') 

- M(a, b, c, a', b', c') 

so that if 

5.7»2 ^ ® f ^ “tlien the r-function is increasing 

“0 *' level 

decreasing . 



ApK 0 



It 






SECTION VI 

6. ENUMERATION OF PERTURBATIONS FOR A 
GIVEN MATCH FUNCTION 

6.1 Description of the Match Function 

The enumeration of perturbations given in Section 5 was for a 
general match function which satisfied certain conditions. A match 
function will now he defined which is a particular case of this general 
match function. A complete enumeration of the single-step perturbations 
which affect it will be given. The match function has not been tested 
in retrieval and has been constructed only with the intention of 
demonstrating the techniques of Section 5* It is shown, however, that 
this particular choice of match function obeys the conditions which 
have been placed upon match functions in Section 3»3* 

Suppose that D is a document and R is a request, where R and D 
are both defined by lists of terms. C is the present state of the 
classification. The match coefficient will be designed to give the 
extent to which R is included in D, both for terms and for classes. 

The class-descriptions of R and D are R* and D* , respectively, and arc- 

defined by; 

R* a f(R, C) and D* = f(D, C) . 

The purpose of f is to provide a translation of a set of terms X into 
a set of classes accordin(5j to the membership of the terms of X to 
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f, 
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classes of the classification. The most natural way of doing this is 
to list all the classes of C which contain at least one terra in eonnnoA 
vrith X. f may, therefore, he defined by, 

6.1.1 f(X, C) - (Cjlc^j = 1; 1; T^ex) . 

Terms coramon both to the document and to the request, however, will 
necessari.ly lead to common classes in their class -descriptions. These 
classes are those which contain at least one of the common terms. If 
these common terms are removed, the class -descriptions of R and D will 
have no necessarily included classes. Another definition of f is, 
therefore, 

6.1.2 f(X, C) ■ (Cjlc^j » 1 ; Tj^eXA(lWD)) 

where the T^ are terms of the vocabulary and the a 2 *e classes of the 
classification C. Both of these definitions of f satisfy 5.2.1. 

In the comparison of a request and. a document it is the included 
terms and classes vrhich contribute positively to the match coefficient 
by the Relevance Hypothesis (1.1.2). Suppose that t is the contribution 
to the match coefficient 1 from the term matches and that c is the con- 
tribution from class matches. Suppose fui’ther that 1 is defined by; 

6.1.3 1 = pt + qc 

where p and q are positive constants. That these constants are positive 
is a consequence of the Relevance Kjrpothesis. The contributions t and 
c to 1 may be designed to give additional emphasis to terras and classes 
belonging to R but not to D by counting these against the match 




ilii 






I 

: 

i 










coefficient. For terms, this is a sli^ghtly ill-advised procedure for 
it places undue reliance upon standard vocabulary use. It is a prob- 
lem more properly dealt ¥ith using the term classification. A simple 
definition of t is, 

6.1.4 t a N(RAD)/K(B) . 

For classes, hov?ever, the position is slightly different. The classes 
of the classification C are intended to represent concepts which are 
apposite to the document collection. If a document is relevant to a 
request, it is, therefore, expected that the class-descriptions con- 
form to a higher degree than the term descriptions. It may be argued, 
therefore, that the absence of a complete concept from a document 
detracts more from the relevance of the document to the request than 
an absent term. A plausible expression for c which will serve for this 
example is, 

6.1.5 c = (N(R*AD* )/N(R* )) - ((NR*A(R*AD* ))/W(R* )) • 

Thus, if there are proportionately more classes of the request con- 
tained in the document than not, then the contribution to the match 
coefficient is positive. Otherwise, it is negative. 

The role of p and q can now be interpreted in terms of the class 
and term matches. The values of p and q may be used to increase the 
importance of class matches compared with terra matches, or vice versa. 
In this example it v/ill be assxmied that in the evaluation of the match 
coefficient a class match is equivalent to a term match. Accordingly, 
in the notation of 5.3*2, 6.1.4, and 6.1.5 become: 





6.1.6 
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t ■ a/n 

6 . 1.7 c » (a* - b* )/n* 
whence from 6.1.3: 

6.1.8 1 ■ a/n (a* - b* )/n* . 

ThlS| together with 6.1.2> Is to be regarded as the match function M 
to be used throughout this example. 

It will now be shown that this satisfies the conditions which 
have been Imposed upon the behaviour of match functions. 6.1.8 satisfies 
3 *3 *2 since It Is a monotonlcally Increasing function of a and a* for 
a specific document and a specific request. It also satisfies 3.3.3 
because It Is Independent of N*, 3*3*^ trivially euid 3*3*6 since It 
does not depend on N. Provided that the request has at least one term^ 

1 Is finite. In addition, since 0 < t < 1 and -1 < c < 1, the match 
coefficient Is bounded above and below. It reaches Its lower and 
upper bound If the term and class descriptions of the request and 
document are mutually disjoint; and reaches Its upper bound If the 
term and class descriptions of the request are entirely contained In 
those of the document. 6.1.8, therefore, satisfies 3.3.4, and, there- 
fore, all the conditions of Section 3.3. The upper and lower bounds 
of 1 are given, respectively, by, 

1+ 



= 2 and 1" = -1 . 






64 



The critical value of the match coefficient is^ therefore^ given^ from 
3.3.8, by; 

1q ” i • 

It is important to recollect that none of the changes caused by perturba 
tions mBQT alter the term descriptions of R and D. Therefore, for a 
particular document said request, t will remain constant. All vari- 
ability in 1 derives from c. From 6.I.71 

6.1.9 col- 2b*/n* 

since a* + b* o n* . Therefore, it may be seen that the match function 
6.1.8 which involves decreasing the match coefficient with the number 
of classes in the request but not in the document, may be regarded as 
equivalent to a match function in which the proportion of shared terms 
and shared classes is measured, fiuid in which matches on classes count 
twice as much as matches on terms. This equivalent match function 
has pel and q = 2, the constant 1 being ignored. 

6.2 Enn"»p>ration of Perturbations 

Ijet A^ l be the change in the match coefficient due to an 

r-function. Then form 6.1.3*. 

Aj,l = Apt + A^c 

s A^c (Apt e 0 since the term description of D and R 

remain unchanged) 








mgm 
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from 6.1.9 



a -2 • ( 



b* + b* ^ 



n.* + A^^n* n* 



-2 • ( 



n* ZJ^b* - b* A^n* 



n* (n* + A n* ) 
r 



) 



2 • s(A^V , A n' , b' , n' ) say. 



where n* ■ a* + b* and A^.n* * A^a* + A^b* . The values of A^b* and A^n* 
may be determined for each r -function from ^.6.3 and for each a- 
functlon form 9*6.4. The pairs of values of these Increments which 
occur for the present match function are (l, l), (l, O), (0, -l), (O^ O)^ 
(-1, -1), (-1, 0), (0^ 1). The Increases In 1 corresponding to these 
changes are given In 6.2.1 below, together with the grouping of the 



perturbations which produce these changes Into Increasing , level, and 
decreasing. 



6.2.1 



/ \ n* - b* 

g(l,l,b*,n*} = — ; > 0 since n* > b' . Decreasing 



n* (n* + 1) 



g(l,0,b*,n*) = 



> 0 



n* + 1 




g(0,-l,bSn«) = 



b* 



n* (n* - 1) 



> 0 



g(0,0,b*,n*) = 0 



Level 



-n* + b* 

-l,-l,b*,n*) = — ; 0 since n* > b* Increasing 



n*(n* - 1) 



1 
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g(-l,0,b* ,n* ) = 




< 0 



Increasing 



g(0,l,b*,n*) 



-b* 



n* (n* + l) 



< 0 



It 



For a particular match function, a number of perturbations 
which, although distinct according to the general theory of Section 5, 
iray be expressible as a single perturbation with suitably adjusted 
conditions on the argianents. The conflation of perturbations in this 
way is called simplification. Suppose that u^^, . . •> where 
k < 3 is a subset of a* , b* , c* . Then from 5»7»1: 

Apl a M(a,b,c,UQ^ + ^'^k^ 

“ M(a,b, c ,Uq^, • • • ,Uj^) • 



Suppose that r(x;y -> z)j x€X^} yeY^; z€Z^ and r(x;y z)j xeX^j 

yeY 5 z€Z are two perturbations and suppose that their effects on 

u-.,...,u, are A u! and A u** , respectively. Then the two perturbations 
•L r i r X 

are said to be equivalent if; 



6.2.2 A u* a A u" for all i < k . 
r 1 r X 

Thus , if two perturbations are equivalent, then they change the match 
coefficient by the sane amount. Suppose that the two perturbations are 
equivalent and suppose further that and Xg are identical sets and 
that and Y are identical sets* Then the two perturbations may he 
sii®lified to r(x:y-* z); xeX^^j yeY^^; zeZ^^Zg provided that 2 .^, Z^ 
belong to R’ or D' . A distinction is maintained between 'new classes’, 







r 



/ 



I 
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classes belonging to R' or D' , and classes belonging to the classifi- 
cation but not to R' or D* (see 5*5*10)* Suppose on the other hand 

that and are the seune set. Then, providing certain conditions 
are satisfied by X^, X^, X^, Y^, Y^, Y^ the two perturbations may be 
simplified to r(x:y -->z); ^ general discussion of 

simplification is not embarked upon here since it will be seen from 
6.2.3 that only two simplifications may be carried out. Analogous 
statements hold for the a-function. 

In 6.2.3 and 6.2.4 the perturbations for the match function 
defined in 6.1.8 are given together with their type (c, r, cn, m) 
and their effect on the match coefficient ( increasing (+), level (o), 
decreasin g (-))• In 6.2.3 two simplifications have been carried out. 

One Involves perturbations for which X" is either or and the 

other involves perturbations for which X" is either G” or G^. The 
complete list of perturbations, both r-, functions and a-fUnctions, is 
\n:itten out in full in 6.2.5* The perturbations are grouped accord- 
ing to increasing, level, and decreasing and each of these groups 
is further divided into c, r, cn, rn type perturbations. 















6 . 2.3 Table of r-functions for 1 = (a/n) + (a* - b* )/n' 



68 



r(x:y-» 


z)j xelC"; 


yeY; zeZ" 






Effect 


X" 


Z" 


Type 


Ayb* 




on 1 


F" 


G* 


r 


0 


0 


0 


11-^ 


H* 


r 


-1 


-1 


+ 


tl 


F* 


r 


0 


-1 


- 


tl 


V 


r 


0 


0 


0 


tl 


v; 


rn 


0 


0 


0 


^11 

?23 


G* 


r 


1 


1 




H» 


r 


0 


0 


0 


II 


F* 


r 


1 


0 


- 


II 


V 


r 


1 


1 


- 


II 


W 


rn 


1 


1 


- 


Tpll 

„42 


G* 


r 


0 


1 
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H* 
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-1 


0 
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II 


F* 
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0 


0 


0 


II 
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0 
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II 


V/ 


rn 


0 
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H'l 


R* 


r 


1 


1 




II -L 


G* 
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-1 


0 


+ 


II 


V 


r 


0 


0 


0 


II 


W 


rn 


0 


0 


0 


IT' 


R* 


r 


0 


0 


0 


11^ 


G* 


r 


0 


1 
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II 


V 


r 


1 


1 
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II 


w 


rn 


1 


1 
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D 


H* 


r 


-1 
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It 
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r 


0 


0 
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II 
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0 
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II 
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0 


0 


0 
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6.2.4 Table of a-functions for 1 = (a/n) + (a* - b* )/a* 




a(x— » z)j x€Xj z€Z Effect 







I 
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6.2.5 Classified list of perturlaations for 1 = (a/n) + (a* - b' )/n' 



Decreasing 

c-type 



r-type 



a(x ->z); x€H; z€V 



r(x 

r(x 

r(x 

r(x 

r(x 

r(x 



y -> z); x€R% 
y -> z ) ; xeFi^ 
y -> z); x€F% 
y -> z)j xeFi^ 
y -> z); x€R% 
y -> z); x€R% 



yeF*; Lj^(y,R); zeF* 
yeF*; L2(y,R); Lj^(y,D); zeG' 
yeF*; L2(y,R); Lj^(y,D); zeF' 
y€F»; L2(y,R); L^(y,D); zeV 
yeH*; Lj^(y,R); zeR* 
ycH*; L2(y,R); zeV 



cn-type 



a(x-> z); x€H; zeW 



m-type 



r(x:y->z); xeFAyj yeF* ; L2(y,R); Lj^(y,D); zeW 
r(x:y -> z)j xeRAy; yeH*; L^(y,R); zeW 












ERIC 



Increasing 

c-type 



a(x -> z)j xeR; zeG* 
a(x -> z)j xeDj zeH* 
a(x -> z); xeFj zeV 



r-type 



r(x:y -> z); xeRAy; yeF»j Lj^(y,R)j zeH* 
r(x:y->z); xeFAy; yeF»; L^{y,R); L^(y,D); zeG» 
r(x:y->z); xeFAy; yeF» ; L2(y,R); L2(y,D); zeH» 
r(x:y->z); xeFAy; yeF»; L2(y,R); zeV 

r(x:y -> z); xeFAy; yeH*; Lj_(y,R); zeG* 
r(x;y -> z); xeFAy; yeH*; L2(y^R)j z^G* 
r(x:y -> z); xelAy; yeG*; zeH* 



cn-type 



a(x-> z); xeF; zeW 



rn-type 



r(x:y ->z); xeFT^; yeF*; L„(y,R); L (y,D); zeW 
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SECTION VII 



7. IMPLEMENTATION 

The most suitable programming language for Implementing the 
perturbations developed by this paper seems to be provided by the STDS 
System (Set -theoretic data structures) of Childs (U)* 
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